title: “K-means Clustering” output: html_document: toc: yes keep_md: yes pdf_document: toc: yes word_document: toc: yes date: “2023-02-24” —
Customer Segmentation is one of the most important applications of unsupervised learning. Using clustering techniques, companies can identify the several segments of customers allowing them to target the potential user base. In this machine learning project, we will make use of K-means clustering which is the essential algorithm for clustering unlabeled dataset.
Customer Segmentation is the process of division of customer base into several groups of individuals that share a similarjity in different ways that are relevant to marketing such as gender, age, interests, and miscellaneous spending habits.
In the first step of this data science project, we will perform data exploration.
We will import the essential packages required for this role and then read our data.
Finally, we will go through the input data to gain necessary insights about it.
install.packages('readxl')
install.packages('tidyverse')
install.packages('janitor')
install.packages('plotly')
install.packages('plotrix')
library(readxl)
library(dplyr)
library(janitor)
library(ggplot2)
library(plotly)
library(plotrix)
library(purrr)
df_customerData <- read_xlsx('Worksheet in Lab Assessment.xlsx')
Look for Null values in Data and Removing the nulls
is.null(df_customerData)
## [1] FALSE
na.omit(df_customerData)
## # A tibble: 200 × 5
## CustomerID Gender Age `Annual Income (k$)` `Spending Score (1-100)`
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
## 7 7 Female 35 18 6
## 8 8 Female 23 18 94
## 9 9 Male 64 19 3
## 10 10 Female 30 19 72
## # … with 190 more rows
df_customerData |>
clean_names() -> df_customerData
Start EDA of the Data by first checking the structure and dimensions of dataset to better understand the Data
dim(df_customerData)
## [1] 200 5
str(df_customerData)
## tibble [200 × 5] (S3: tbl_df/tbl/data.frame)
## $ customer_id : num [1:200] 1 2 3 4 5 6 7 8 9 10 ...
## $ gender : chr [1:200] "Male" "Male" "Female" "Female" ...
## $ age : num [1:200] 19 21 20 23 31 22 35 23 64 30 ...
## $ annual_income_k : num [1:200] 15 15 16 16 17 17 18 18 19 19 ...
## $ spending_score_1_100: num [1:200] 39 81 6 77 40 76 6 94 3 72 ...
names(df_customerData)
## [1] "customer_id" "gender" "age"
## [4] "annual_income_k" "spending_score_1_100"
head(df_customerData)
## # A tibble: 6 × 5
## customer_id gender age annual_income_k spending_score_1_100
## <dbl> <chr> <dbl> <dbl> <dbl>
## 1 1 Male 19 15 39
## 2 2 Male 21 15 81
## 3 3 Female 20 16 6
## 4 4 Female 23 16 77
## 5 5 Female 31 17 40
## 6 6 Female 22 17 76
summary(df_customerData)
## customer_id gender age annual_income_k
## Min. : 1.00 Length:200 Min. :18.00 Min. : 15.00
## 1st Qu.: 50.75 Class :character 1st Qu.:28.75 1st Qu.: 41.50
## Median :100.50 Mode :character Median :36.00 Median : 61.50
## Mean :100.50 Mean :38.85 Mean : 60.56
## 3rd Qu.:150.25 3rd Qu.:49.00 3rd Qu.: 78.00
## Max. :200.00 Max. :70.00 Max. :137.00
## spending_score_1_100
## Min. : 1.00
## 1st Qu.:34.75
## Median :50.00
## Mean :50.20
## 3rd Qu.:73.00
## Max. :99.00
We will now display the standard deviation of our variable “age” using the sd() function and use the summary() function to “age”.
summary(df_customerData$age)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 18.00 28.75 36.00 38.85 49.00 70.00
sd(df_customerData$age)
## [1] 13.96901
We want to see the Age Distribution so we need to visually see how the ages of customers are distributed in the dataset
ggplotly(ggplot(df_customerData,aes(age))+
geom_histogram(binwidth = 4,col = '#000000',fill= '#0099F8')+
stat_bin(binwidth = 4,geom = 'text',color = '#000000',
aes(label = ..count..,vjust = 0.9))+
labs(x= "Age Groups", y = "Frequency",
title = "Age Frequency Distribution Histogram")+
theme(axis.text.x = element_text(colour = 'black'),
axis.text.y = element_text(colour = 'black'),
axis.title = element_text(color ='#000000',
family = "Century Gothic", face = 'plain'),
title = element_text(face = "bold.italic",family = "Century Gothic")))
ggplotly(ggplot(df_customerData,aes(gender,age,fill = gender))+
geom_boxplot()+
labs(x= "Age Groups", y = "Age",
title = "Age Frequency Distribution Histogram")+
theme(axis.text.x = element_text(colour = 'black'),
axis.text.y = element_text(colour = 'black'),
axis.title = element_text(color ='#000000',
family = "Century Gothic", face = 'plain'),
title = element_text(face = "plain",family = "Century Gothic"))+
scale_fill_discrete(name = "Gender"))
ggplot(df_customerData,aes(age,label = ..count..,fill = gender))+
geom_histogram(binwidth = 4,color = 'white')+
stat_bin(binwidth = 4,geom = 'text',
aes(label = ..count..,vjust = 1.5))+
labs(x= "Age Groups", y = "Frequency",
title = "Age Frequency Distribution Histogram")+
theme(axis.text.x = element_text(colour = 'black'),
axis.text.y = element_text(colour = 'black'),
axis.title = element_text(color ='#000000',
family = "Century Gothic", face = 'plain'),
title = element_text(face = "plain",family = "Century Gothic"))+
scale_fill_discrete(name = "Gender")
a <- table(df_customerData$gender)
pct=round(a/sum(a)*100)
lbs=paste(c("Female","Male")," ",pct,"%",sep=" ")
library(plotrix)
pie3D(a,labels=lbs,
main="Pie Chart Depicting Ratio of Female and Male")
ggplotly(ggplot(df_customerData,aes(gender,fill = gender))+
geom_bar()+
labs(x= "Gender", y = "Counts",
title = "No. of Males and Females")+
theme(axis.text.x = element_text(colour = 'black'),
axis.text.y = element_text(colour = 'black'),
axis.title = element_text(color ='#000000',
family = "Century Gothic", face = 'plain'),
title = element_text(face = "plain",family = "Century Gothic"))+
scale_fill_discrete(name = "Gender"))
From the above graph, we conclude that the percentage of females is 56%, whereas the percentage of male in the customer data set is 44%.
summary(df_customerData$annual_income_k)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 15.00 41.50 61.50 60.56 78.00 137.00
sd(df_customerData$annual_income_k)
## [1] 26.26472
#Histogram for Annual Income
hist(df_customerData$annual_income_k,
col="#660033",
main="Histogram for Annual Income",
xlab="Annual Income Class",
ylab="Frequency",
labels=TRUE)
From the above descriptive analysis, we conclude that the minimum annual income of the customers is 15 and the maximum income is 137.People earning an average income of 70 have the highest frequency count in our histogram distribution. The average salary of all the customers is 60.56
ggplot(df_customerData,aes(annual_income_k,fill = gender))+
geom_density(col = 'black',alpha = 0.2)+
facet_wrap(~gender)+xlim(0,150)
Desnity plot shows normal distribution of annual incomes along both males and females
summary(df_customerData$spending_score_1_100)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.00 34.75 50.00 50.20 73.00 99.00
ggplot(df_customerData,aes(spending_score_1_100))+
geom_histogram(binwidth = 10,col = '#000000',fill= '#0099F8')+
stat_bin(binwidth = 10,geom = 'text',color = '#000000',
aes(label = ..count..,vjust = -0.5))+
labs(x= "Spending Scores", y = "Frequency",
title = "Spending Scores Distribution Histogram")+theme_bw()
We have almost 21% customers with a spending score 0f 50 and maximum score is 99 and minimun is 1 with an average score of 50. Approximately 33% customers falls in the class of 40-50 spending scores
While using the k-means clustering algorithm, the first step is to indicate the number of clusters (k) that we wish to produce in the final output. The algorithm starts by selecting k objects from dataset randomly that will serve as the initial centers for our clusters. These selected objects are the cluster means, also known as centroids. Then, the remaining objects have an assignment of the closest centroid. This centroid is defined by the Euclidean Distance present between the object and the cluster mean. We refer to this step as “cluster assignment”. When the assignment is complete, the algorithm proceeds to calculate new mean value of each cluster present in the data. After the recalculation of the centers, the observations are checked if they are closer to a different cluster. Using the updated cluster mean, the objects undergo reassignment. This goes on repeatedly through several iterations until the cluster assignments stop altering. The clusters that are present in the current iteration are the same as the ones obtained in the previous iteration.
While working with clusters, you need to specify the number of clusters to use. You would like to utilize the optimal number of clusters. To help you in determining the optimal clusters, there are three popular methods –
Elbow method
Silhouette method
Gap statistic
The main goal behind cluster partitioning methods like k-means is to define the clusters such that the intra-cluster variation stays minimum.
minimize(sum W(Ck)), k=1…k
Where Ck represents the kth cluster and W(Ck) denotes the intra-cluster variation. With the measurement of the total intra-cluster variation, one can evaluate the compactness of the clustering boundary. We can then proceed to define the optimal clusters as follows –
First, we calculate the clustering algorithm for several values of k. This can be done by creating a variation within k from 1 to 10 clusters. We then calculate the total intra-cluster sum of square (iss). Then, we proceed to plot iss based on the number of k clusters. This plot denotes the appropriate number of clusters required in our model. In the plot, the location of a bend or a knee is the indication of the optimum number of clusters. Let us implement this in R as follows.
set.seed(123)
# function to calculate total intra-cluster sum of square
iss <- function(k) {
kmeans(df_customerData[,3:5],k,iter.max=100,nstart=100,algorithm="Lloyd" )$tot.withinss
}
k.values <- 1:10
iss_values <- map_dbl(k.values, iss)
plot(k.values, iss_values,
type="b", pch = 19, frame = FALSE,
xlab="Number of clusters K",
ylab="Total intra-clusters sum of squares")
From the above graph, we conclude that 4 is the appropriate number of clusters since it seems to be appearing at the bend in the elbow plot.
pcclust=prcomp(df_customerData[,3:5],scale=FALSE) #principal component analysis
summary(pcclust)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 26.4625 26.1597 12.9317
## Proportion of Variance 0.4512 0.4410 0.1078
## Cumulative Proportion 0.4512 0.8922 1.0000
pcclust$rotation[,1:2]
## PC1 PC2
## age 0.1889742 -0.1309652
## annual_income_k -0.5886410 -0.8083757
## spending_score_1_100 -0.7859965 0.5739136
k4 <- kmeans(df_customerData[,3:5],4,iter.max=100,nstart=50,algorithm="Lloyd")
set.seed(1)
ggplotly(ggplot(df_customerData, aes(x =annual_income_k, y = spending_score_1_100)) +
geom_point(stat = "identity", aes(color = as.factor(k4$cluster))) +
scale_color_discrete(name=" ",
breaks=c("1", "2", "3", "4", "5","6"),
labels=c("Cluster 1", "Cluster 2", "Cluster 3", "Cluster 4", "Cluster 5","Cluster 6")) +
ggtitle("Segments of Mall Customers", subtitle = "Using K-means Clustering")+
labs(x = 'Annual Income Thousands ($)', y ='Spending score '))
From the above visualization, we observe that there is a distribution of 6 clusters as follows –
Cluster 6 and 4 – These clusters represent the customer_data with the medium income salary as well as the medium annual spend of salary.
Cluster 1 – This cluster represents the customer_data having a high annual income as well as a high annual spend.
Cluster 3 – This cluster denotes the customer_data with low annual income as well as low yearly spend of income.
Cluster 2 – This cluster denotes a high annual income and low yearly spend.
Cluster 5 – This cluster represents a low annual income but its high yearly expenditure.
pcclust=prcomp(df_customerData[,3:5],scale=FALSE) #principal component analysis
summary(pcclust)
## Importance of components:
## PC1 PC2 PC3
## Standard deviation 26.4625 26.1597 12.9317
## Proportion of Variance 0.4512 0.4410 0.1078
## Cumulative Proportion 0.4512 0.8922 1.0000
pcclust$rotation[,1:2]
## PC1 PC2
## age 0.1889742 -0.1309652
## annual_income_k -0.5886410 -0.8083757
## spending_score_1_100 -0.7859965 0.5739136
kCols = function(vec){cols=rainbow (length (unique (vec)))
return (cols[as.numeric(as.factor(vec))])}
digCluster<-k4$cluster; dignm<-as.character(digCluster); # K-means clusters
plot(pcclust$x[,1:2], col =kCols(digCluster),pch =19,xlab ="K-means",ylab="classes")
legend("bottomleft",unique(dignm),fill=unique(kCols(digCluster)))
Cluster 4 and 1 – These two clusters consist of customers with medium PCA1 and medium PCA2 score.
Cluster 6 – This cluster represents customers having a high PCA2 and a low PCA1.
Cluster 5 – In this cluster, there are customers with a medium PCA1 and a low PCA2 score.
Cluster 3 – This cluster comprises of customers with a high PCA1 income and a high PCA2.
Cluster 2 – This comprises of customers with a high PCA2 and a medium annual spend of income.
With the help of clustering, we can understand the variables much better, prompting us to take careful decisions. With the identification of customers, companies can release products and services that target customers based on several parameters like income, age, spending patterns, etc. Furthermore, more complex patterns like product reviews are taken into consideration for better segmentation.